Air Quality in LA

Introduction to R & Exploratory Data Visualisation

Preliminaries

To get started with this course, you will need R. I strongly recommend Rstudio. To function correctly, RStudio needs R and therefore both need to be installed on your computer. If you already have these installed, please see this page to make sure you ahve up to date versions of the software. We will talk about other required software as we proceed further.

Why R?

  • Great for reproducibility
  • Great for statistics
  • Great for data analysis and visualization
  • Interdisciplinary & Extensible
  • Open source & Cross Platform
  • Most GIS operations can be done in R

See R-ecology for more details.

Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn’t equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness. – Patrick Burns: R-help (April 2005)

Why not Python? Or Julia? Or C++?

Mostly because I don’t know them well/or at all. They are probably great, but R works for me, for the most part. See Sebastian’s post on Why Python?.

Good Practices

  • Keep track of who wrote what code at the top of the file and its purpose.
  • Use comments liberally using “#”.
  • Use setwd() only at the top of the script.
  • Distinct components of the code should ideally be separate and be accessed using source(). This improves legibility.
  • Use a consistent style within your code. For example, name all dataframes to something ending in _df.
  • Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them.
  • Memory is a problem for R, because it stores all objects in memory (mostly true). Follow good practice of deleting objects that you don’t need using rm(). It is rarely necessary for you to use garbage collection gc(). If you are running to into problems constantly
    • Consider upgrading RAM (It is becoming cheaper by the day).
    • Consider if you can break up the data set and independently process them and reassemble them.
    • Find alternative libraries and methods.

Basic R

Once you have R & Rstudio installed, run the following to see if it is working correctly.

1 + 1   #add 1 + 1
# [1] 2
a<-2:5  #Assign vector 2:5 to a variable a
a      #print a to console
# [1] 2 3 4 5
runif(5)   #Generate 5 random numbers
# [1] 0.41061014 0.79963303 0.89764367 0.58700535 0.08944311
sessionInfo()     #What is my computer environment? (Mac, Windows, libraries etc.)
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.6
# 
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] here_0.1    rgdal_1.3-4 sp_1.3-1    tigris_0.7 
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.18     pillar_1.3.0     compiler_3.5.1   bindr_0.1.1     
#  [5] class_7.3-14     tools_3.5.1      uuid_0.1-2       digest_0.6.16   
#  [9] evaluate_0.11    tibble_1.4.2     lattice_0.20-35  pkgconfig_2.0.2 
# [13] rlang_0.2.2      DBI_1.0.0        curl_3.2         yaml_2.2.0      
# [17] blogdown_0.8     xfun_0.3         bindrcpp_0.2.2   spData_0.2.9.3  
# [21] e1071_1.7-0      dplyr_0.7.6      stringr_1.3.1    httr_1.3.1      
# [25] knitr_1.20       rappdirs_0.3.1   classInt_0.2-3   rprojroot_1.3-2 
# [29] grid_3.5.1       tidyselect_0.2.4 glue_1.3.0       sf_0.6-3        
# [33] R6_2.2.2         foreign_0.8-70   rmarkdown_1.10   bookdown_0.7    
# [37] purrr_0.2.5      magrittr_1.5     units_0.6-0      backports_1.1.2 
# [41] htmltools_0.3.6  maptools_0.9-3   assertthat_0.2.0 stringi_1.2.4   
# [45] crayon_1.3.4

The output of the last two commands will be different on your computer than what is being shown.


Exercise

  • If a vector a is (4,6,5,7, 10, 9, 4, 15), how many elements are over 7? What position in the vector are they?
  • if b is 3, what is a+b? If b is (2,5), what is a+b?

Literate programming and RMarkdown

Literate Programming is a concept introduced by Donald Knuth. The fundamental premis is that one should write code that communicates primarily to hummans, not computers. Here are some examples:

RMarkdown allows you easily write documents that contain: R code, text, images, links, etc. You are required to use it for this course and you should effectively use it as a lab notebook for your data analysis everywhere. This and other tutorials in this sequence are written in Rmarkdown.

Getting started with Markdown

To start a new Markdown file, open up Rstudio and follow the following picture.

The created template is quite informative. Use Knit on the top to see the output of the template.

The parameters between the two --- is a YAML header. YAML(YAML Ain’t Markup Language) is a human-readable data serialization language. Most of the document rendering settings are there. Experiment with them. By simply changing the output format, we can get a doc file or a html file.

The rest of the document is formatted according to markdown syntax. Below are some useful syntax to structure your documents.

# Heading 1
## Heading 2


*italics*

**bold**

~~strikethrough~~

~subscript~

^superscript^

![](pictures/notebook.png)

[link](www.link.com)

$E = mc^2$

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

Other syntax are available at Rstudio website.

R Markdown

To include R code in your file you can insert a code chunk with

  • the keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I)
  • the Add Chunk command in the editor toolbar
  • or by typing the chunk delimiters ```{r} and ```

Chunk output can be customized with knitr options⧉, arguments set in the {} of a chunk header. The main ones are include, cache, echo, message, warning and fig.cap

Here is an example

summary(cars) #cars is a dataset included in base R
#      speed           dist       
#  Min.   : 4.0   Min.   :  2.00  
#  1st Qu.:12.0   1st Qu.: 26.00  
#  Median :15.0   Median : 36.00  
#  Mean   :15.4   Mean   : 42.98  
#  3rd Qu.:19.0   3rd Qu.: 56.00  
#  Max.   :25.0   Max.   :120.00

It is often the case that when markdown files get big, it is useful to cache the output using the cache option for the code chunks.


Exercise

  • Extend the template that creating Rmarkdown file with one more R command and one more graphic on cars dataset

Getting started with real data

In this dataset, we are going to summarise Air Quality Index for about 4000 stations in the US in 2017. These datasets are downloaded from the EPA website. A local copy of the dataset is here

read_csv("./datasets/1.introtoR/daily_44201_2017.csv") #Change file paths accordingly 
library(tidyverse)
dim(ozone_airq_df)
# [1] 402501     29
names(ozone_airq_df)
#  [1] "State Code"          "County Code"         "Site Num"           
#  [4] "Parameter Code"      "POC"                 "Latitude"           
#  [7] "Longitude"           "Datum"               "Parameter Name"     
# [10] "Sample Duration"     "Pollutant Standard"  "Date Local"         
# [13] "Units of Measure"    "Event Type"          "Observation Count"  
# [16] "Observation Percent" "Arithmetic Mean"     "1st Max Value"      
# [19] "1st Max Hour"        "AQI"                 "Method Code"        
# [22] "Method Name"         "Local Site Name"     "Address"            
# [25] "State Name"          "County Name"         "City Name"          
# [28] "CBSA Name"           "Date of Last Change"

If library(.) command returns an error. It is likely that you do not have the package . installed on your computer. Install them by e.g. install.packages(tidyverse). You only need to install pacakges once. But you need to call them into your library in every R session.

Zhu Wang: I am trying to create a library which uses some Fortran source files […] Douglas Bates: Someone named Martin Maechler will shortly be sending you email regarding the distinction between ‘library’ and ‘package’ :-) – Zhu Wang and Douglas Bates: R-help (May 2004)

In the US, State and Counties are uniquely identified by a 2 and 3 digit FIPS (Federal Information Processing System) Code. Lets create it so that we can merge datasets together.

ozone_airq_df$stcofips <- paste(formatC(ozone_airq_df$`State Code`, width=2,flag="0"),formatC(ozone_airq_df$`County Code`, width=3,flag="0"), sep="" )

head(ozone_airq_df$stcofips) #Inspect the outcome
# [1] "01003" "01003" "01003" "01003" "01003" "01003"
tail(ozone_airq_df$stcofips)
# [1] "80026" "80026" "80026" "80026" "80026" "80026"
unique(nchar(ozone_airq_df$stcofips)) #Making sure all fips codes are of length 5 characters
# [1] 5

What we really need is a few columns. So we extract them

ozone_airq_df2 <- ozone_airq_df[,c('stcofips', 'Site Num', 'Latitude', 'Longitude', 'Date Local', 'State Name', 'County Name', 'CBSA Name', 'AQI')]
summary(ozone_airq_df2)
#    stcofips           Site Num            Latitude       Longitude      
#  Length:402501      Length:402501      Min.   :18.18   Min.   :-158.09  
#  Class :character   Class :character   1st Qu.:34.10   1st Qu.:-110.10  
#  Mode  :character   Mode  :character   Median :38.34   Median : -91.53  
#                                        Mean   :37.64   Mean   : -95.33  
#                                        3rd Qu.:41.06   3rd Qu.: -82.17  
#                                        Max.   :64.85   Max.   : -65.92  
#    Date Local          State Name        County Name       
#  Min.   :2017-01-01   Length:402501      Length:402501     
#  1st Qu.:2017-04-12   Class :character   Class :character  
#  Median :2017-06-30   Mode  :character   Mode  :character  
#  Mean   :2017-06-30                                        
#  3rd Qu.:2017-09-19                                        
#  Max.   :2017-12-31                                        
#   CBSA Name              AQI        
#  Length:402501      Min.   :  0.00  
#  Class :character   1st Qu.: 31.00  
#  Mode  :character   Median : 38.00  
#                     Mean   : 40.13  
#                     3rd Qu.: 45.00  
#                     Max.   :233.00
nrow(unique(ozone_airq_df2[,c('Latitude', 'Longitude')]))
# [1] 1265
length(unique(ozone_airq_df2$stcofips))
# [1] 786

i.e. There are 1,265 unique monitoring station locations in 786 counties. Roughly, on average 1.61 per county. However, there are about ~3100 counties in the US, suggesting that large parts of the US are not covered by the monitoring stations.

You can also follow the tidyverse syntax. select is more powerful because it has helper functions that will

ozone_airq_df3 <- select(ozone_airq_df, `stcofips`, `Site Num`, `Latitude`, `Longitude`, `Date Local`, `State Name`, `County Name`, `CBSA Name`, `AQI`) #We only needed `` because of spaces in column names. 
all.equal(ozone_airq_df2, ozone_airq_df3)
# [1] TRUE
rm(ozone_airq_df3)

select is more powerful than [ because it has helper functions that will select columns based on pattern matching for example select(ozone_airq_df , ends_with(Name)) would select County Name, State Name and CBSA Name columns.

The space in the column names are problematic, so lets rename columns

ozone_airq_df2 <- rename(ozone_airq_df2, 
                 County_Name = `County Name` ,  State_Name = `State Name` ,  CBSA_Name = `CBSA Name`, dateL = `Date Local`, Site_Num = `Site Num`
                 )

Order, Mutate, Group, Filter: Work horses of data manipulation

Order

This dataset has repeated observations per monitoring station. It would be nice if they were rearranged in a nested fashion, i.e. states | counties | Monitoring Station | Date

select(arrange(ozone_airq_df2, stcofips, desc(dateL)), stcofips, dateL, AQI) #select is only to demonstrate
# # A tibble: 402,501 x 3
#    stcofips dateL        AQI
#    <chr>    <date>     <int>
#  1 01003    2017-10-31    45
#  2 01003    2017-10-30    38
#  3 01003    2017-10-28    24
#  4 01003    2017-10-27    32
#  5 01003    2017-10-26    43
#  6 01003    2017-10-25    35
#  7 01003    2017-10-24    36
#  8 01003    2017-10-23    31
#  9 01003    2017-10-22    27
# 10 01003    2017-10-21    29
# # ... with 402,491 more rows

We take advantage of the fact that FIPS are arranged in a natural order. Otherwise we could also do

select(arrange(ozone_airq_df2, State_Name, County_Name, desc(dateL)), stcofips, dateL, AQI) #select is only to demonstrate
# # A tibble: 402,501 x 3
#    stcofips dateL        AQI
#    <chr>    <date>     <int>
#  1 01003    2017-10-31    45
#  2 01003    2017-10-30    38
#  3 01003    2017-10-28    24
#  4 01003    2017-10-27    32
#  5 01003    2017-10-26    43
#  6 01003    2017-10-25    35
#  7 01003    2017-10-24    36
#  8 01003    2017-10-23    31
#  9 01003    2017-10-22    27
# 10 01003    2017-10-21    29
# # ... with 402,491 more rows

Exercise

  • Print the top 10 AQI values and their corresponding sites and dates

Mutate

Mutate creates new variables out of old ones. The key functions are mutate and mutate_*. mutate_** will let us apply the same function to multiple columns that are seleted. You’ll want to use mutate just about any time you want to create a new variable in your data frame or alter and overwrite an existing variable.

# Convert Longitude and Latitude to radians
mutate(ozone_airq_df2, 
                Latitude_rad = pi * Latitude / 180,
                Longitude_rad = pi * Longitude / 180) %>%
  select(contains('tude'))
# # A tibble: 402,501 x 4
#    Latitude Longitude Latitude_rad Longitude_rad
#       <dbl>     <dbl>        <dbl>         <dbl>
#  1     30.5     -87.9        0.532         -1.53
#  2     30.5     -87.9        0.532         -1.53
#  3     30.5     -87.9        0.532         -1.53
#  4     30.5     -87.9        0.532         -1.53
#  5     30.5     -87.9        0.532         -1.53
#  6     30.5     -87.9        0.532         -1.53
#  7     30.5     -87.9        0.532         -1.53
#  8     30.5     -87.9        0.532         -1.53
#  9     30.5     -87.9        0.532         -1.53
# 10     30.5     -87.9        0.532         -1.53
# # ... with 402,491 more rows


# or do this.

mutate_at(ozone_airq_df2, vars(contains("tude")), funs(rad = . /180 * pi) ) %>% select(contains('tude'))
# # A tibble: 402,501 x 4
#    Latitude Longitude Latitude_rad Longitude_rad
#       <dbl>     <dbl>        <dbl>         <dbl>
#  1     30.5     -87.9        0.532         -1.53
#  2     30.5     -87.9        0.532         -1.53
#  3     30.5     -87.9        0.532         -1.53
#  4     30.5     -87.9        0.532         -1.53
#  5     30.5     -87.9        0.532         -1.53
#  6     30.5     -87.9        0.532         -1.53
#  7     30.5     -87.9        0.532         -1.53
#  8     30.5     -87.9        0.532         -1.53
#  9     30.5     -87.9        0.532         -1.53
# 10     30.5     -87.9        0.532         -1.53
# # ... with 402,491 more rows

A useful thing to do is to create a unique id for each monitoring station. We can accomplish it using mutate.

ozone_airq_df2 <- mutate(ozone_airq_df2, SiteID = paste(stcofips, Site_Num, sep="_")) 

Exercise

  • Calculate the distance to the Center of the US from each site. Haversine fomula for great circle distances is acos(sin(lat1).sin(lat2) + cos(lat1).cos(lat2) .cos(long2-long1)) . 6371 in km. Assume the center of US is at 39.8333333,-98.585522 (in Kansas)

Filter

Sometimes instead of subsetting columns, you want to subset rows. Filter is an useful command for these.

filter(ozone_airq_df2, stcofips == "06037") #Los Angeles County
# # A tibble: 4,574 x 10
#    stcofips Site_Num Latitude Longitude dateL      State_Name County_Name
#    <chr>    <chr>       <dbl>     <dbl> <date>     <chr>      <chr>      
#  1 06037    0002         34.1     -118. 2017-01-01 California Los Angeles
#  2 06037    0002         34.1     -118. 2017-01-02 California Los Angeles
#  3 06037    0002         34.1     -118. 2017-01-03 California Los Angeles
#  4 06037    0002         34.1     -118. 2017-01-04 California Los Angeles
#  5 06037    0002         34.1     -118. 2017-01-05 California Los Angeles
#  6 06037    0002         34.1     -118. 2017-01-06 California Los Angeles
#  7 06037    0002         34.1     -118. 2017-01-07 California Los Angeles
#  8 06037    0002         34.1     -118. 2017-01-08 California Los Angeles
#  9 06037    0002         34.1     -118. 2017-01-09 California Los Angeles
# 10 06037    0002         34.1     -118. 2017-01-10 California Los Angeles
# # ... with 4,564 more rows, and 3 more variables: CBSA_Name <chr>,
# #   AQI <int>, SiteID <chr>

Group_by

Even though we rearranged the data, we did not yet split it to calcuate group summaries. For example if you wanted to find out which monitoring stations had readings for less than 30 days you could do the following.

group_by(ozone_airq_df2, SiteID) %>%
  filter(n() < 30)
# # A tibble: 43 x 10
# # Groups:   SiteID [4]
#    stcofips Site_Num Latitude Longitude dateL      State_Name County_Name
#    <chr>    <chr>       <dbl>     <dbl> <date>     <chr>      <chr>      
#  1 17083    0117         39.1     -90.3 2017-10-03 Illinois   Jersey     
#  2 17083    0117         39.1     -90.3 2017-10-04 Illinois   Jersey     
#  3 17083    0117         39.1     -90.3 2017-10-05 Illinois   Jersey     
#  4 17083    0117         39.1     -90.3 2017-10-06 Illinois   Jersey     
#  5 17083    0117         39.1     -90.3 2017-10-07 Illinois   Jersey     
#  6 17083    0117         39.1     -90.3 2017-10-08 Illinois   Jersey     
#  7 17083    0117         39.1     -90.3 2017-10-09 Illinois   Jersey     
#  8 17083    0117         39.1     -90.3 2017-10-10 Illinois   Jersey     
#  9 17083    0117         39.1     -90.3 2017-10-11 Illinois   Jersey     
# 10 17083    0117         39.1     -90.3 2017-10-12 Illinois   Jersey     
# # ... with 33 more rows, and 3 more variables: CBSA_Name <chr>, AQI <int>,
# #   SiteID <chr>

Group_by is quite powerful to calculate within subgroup summaries and use it with mutate to change each row. For example, we can calculate max AQI of all the monitors in a county for the day

group_by(ozone_airq_df2, stcofips, dateL) %>%
  summarize(maxAQI=max(AQI))
# # A tibble: 242,122 x 3
# # Groups:   stcofips [?]
#    stcofips dateL      maxAQI
#    <chr>    <date>      <dbl>
#  1 01003    2017-02-28     19
#  2 01003    2017-03-01     36
#  3 01003    2017-03-02     42
#  4 01003    2017-03-03     48
#  5 01003    2017-03-04     49
#  6 01003    2017-03-05     48
#  7 01003    2017-03-06     45
#  8 01003    2017-03-07     44
#  9 01003    2017-03-08     44
# 10 01003    2017-03-09     44
# # ... with 242,112 more rows

This represents the worst AQI recorded in the county for that day. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 represents good air quality with little potential to affect public health, while an AQI value over 300 represents hazardous air quality.

An AQI value of 100 generally corresponds to the national air quality standard for the pollutant (in our case Ozone), which is the level EPA has set to protect public health. AQI values below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy-at first for certain sensitive groups of people, then for everyone as AQI values get higher.

county_summary_df <- group_by(ozone_airq_df2, stcofips, dateL) %>%
  summarize(maxAQI=max(AQI),
            State_Name=first(State_Name), # We can do this because we know that there is only one state corresponding to a stcofips, i.e. counties are embedded in the state.
            County_Name=first(County_Name), # We also need these variables to carry over to the next step.
            CBSA_Name=first(CBSA_Name)) %>%
  group_by(stcofips) %>%
  summarize(AQIgt100 = sum(maxAQI>=100), 
            numDays= n(), 
            percAQIgt100 = AQIgt100/numDays, 
            State_Name=first(State_Name), 
            County_Name=first(County_Name), 
            CBSA_Name=first(CBSA_Name)
            )

county_summary_df
# # A tibble: 786 x 7
#    stcofips AQIgt100 numDays percAQIgt100 State_Name County_Name CBSA_Name
#    <chr>       <int>   <int>        <dbl> <chr>      <chr>       <chr>    
#  1 01003           2     235      0.00851 Alabama    Baldwin     Daphne-F…
#  2 01033           0     246      0       Alabama    Colbert     Florence…
#  3 01049           0     359      0       Alabama    DeKalb      Fort Pay…
#  4 01051           0     230      0       Alabama    Elmore      Montgome…
#  5 01055           0     245      0       Alabama    Etowah      Gadsden,…
#  6 01069           0     245      0       Alabama    Houston     Dothan, …
#  7 01073           2     360      0.00556 Alabama    Jefferson   Birmingh…
#  8 01089           1     246      0.00407 Alabama    Madison     Huntsvil…
#  9 01097           3     241      0.0124  Alabama    Mobile      Mobile, …
# 10 01101           0     242      0       Alabama    Montgomery  Montgome…
# # ... with 776 more rows

Exercise

  • Create similar summary by state and CBSA seprately
  • By state and CBSA create monthly summaries of percentage of days with AQI \(\ge\) 100.

Exploratory data visualisation

The first thing to do with any data is to explore it and identify some issues, edge cases, outliers, intersting correlations etc. Exploratory visualisation is indispensible part of data analysis and often overlooked. I cannot stress the importance of it enough.

“The simple graph has brought more information to the data analyst’s mind than any other device.”—John Tukey

We will use ggplot graphics to explore the datasets. See Hadley’s tutorial

Basic plotting

First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function, where df is a dataframe that contains all features needed to make the plot.

You can add whatever aesthetics you want to apply to your ggplot (inside aes() argument) - such as X and Y axis by specifying the respective variables from the dataset. The variable based on which the color, size, shape and stroke should change can also be specified here itself. The aesthetics specified here will be inherited by all the geom layers you will add subsequently. Or you can specify the aesthetics for individual layers. I prefer the latter method, even though it is repetitive.

Finally, you can add labels to axes and other features.


g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06059") %>% # Orange County
  ggplot() + 
  geom_point(aes(x=dateL, y=AQI, color=SiteID)) +
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), method="loess")+
  scale_colour_brewer(palette = "Set2") + 
  labs(x = "Month", y = "Air Quality Index")

library(plotly)
ggplotly(g1)

There is also no reason to use a cartesian coordinate system for axes. Polar coordinates work just as well, especially when we have cycles. Though of not much use here, I will demonstrate coord_polar’s effect.


ozone_airq_df2 %>% 
  filter(stcofips == "06059") %>% # Orange County
  ggplot() + 
  coord_polar(theta = "x")+
  geom_point(aes(x=dateL, y=AQI, color=SiteID), alpha=.5, show.legend = FALSE) +
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), se=FALSE)+
  scale_colour_brewer(palette = "Dark2") + 
  labs(x = "Month", y = "Air Quality Index")

Facetting

Facetting allows to plot subsets of data together and place them next to one another. This is inspired by Edward Tufte’s small multiples. It allows us to see patterns in the data more clearly than otherwise. Normally, the axis scales on each graph are fixed, which means that they have the same size and range across the facets.

Let’s facet the data for various counties in California, while showing all the monitors within a county in a single graph.

 g1 <- ozone_airq_df2 %>% 
  filter(State_Name == "California") %>% 
  ggplot() + 
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), method="loess",  se=FALSE)+
  scale_colour_grey(end = 0)+
  facet_wrap(~stcofips)+
  labs(x = "Month", y = "Air Quality Index") + 
  theme_bw() + 
  theme(axis.text.x=element_blank(),
        legend.position="none")

 ggplotly(g1)

Exercise

  • Facet CA counties by CBSA (i.e CBSA in rows and Counties in Columns) and produce the same graph as above
  • Using facetting as above, limit the x-axis to summer months.

If we were to show this in a single graph, we could do something as below. This graph shows how Ozone is worse during the summer months than the rest of the year.

g1 <- ozone_airq_df2 %>% 
  filter(State_Name == "California")  %>% 
  ggplot() + 
  geom_smooth(aes(x=dateL, y=AQI, group=SiteID, color=stcofips), method="loess", se=FALSE)+
  labs(x = "Month", y = "Air Quality Index") + 
  theme_bw() 

ggplotly(g1)

The display is non-inutitive. Colors are a terrible way to convey information, even though we use it all the time. Especially colors that are on a gradient to represent categorical data such as County. I much prefer the facetted plot.

Statitical summaries by group

g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06037") %>% 
  ggplot() + 
  geom_point(aes(x=Site_Num, y=AQI), show.legend = F)+
  stat_summary(aes(x=Site_Num, y=AQI), fun.y='median', colour = "red", size = 2)+
  stat_summary(aes(x=Site_Num, y=AQI),fun.y='mean', colour = "green", size = 3, shape=3)+
  labs(x = "Site", y = "Air Quality Index", title="Los Angeles County") + 
  theme_bw() 

ggplotly(g1)

A violin or box plot is better.


g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06037") %>% 
  ggplot() + 
  geom_violin(aes(x=Site_Num, y=AQI), show.legend = F)+
  stat_summary(aes(x=Site_Num, y=AQI), fun.y='median', colour = "red", size = 2)+
  stat_summary(aes(x=Site_Num, y=AQI),fun.y='mean', size = 3, shape=3)+
  labs(x = "Site", y = "Air Quality Index", title="Los Angeles County") + 
  theme_bw() 

ggplotly(g1)

Exercise

  • Add 0.25 and 0.75 quantiles as points to the above plot.

It would be useful which counties have the worst pollution. We can define worse pollution by percent of days where AQI from Ozone is \(\ge\) 100. We can accomplish this with a bar plot and plot only the top 25 worst polluted counties.

g2 <- county_summary_df %>%
  top_n(25, percAQIgt100) %>%
  ggplot()+
  geom_col(aes(x=reorder(paste(County_Name, State_Name, sep=","), percAQIgt100), y=percAQIgt100, fill=State_Name)) + 
  scale_fill_brewer(palette = "Dark2", name="State") +
  coord_flip()+
  labs(y="Proportion of days with Ozone AQI greater than 100", x="County", Title="Top 25 polluted counties")+
  theme_minimal()

ggplotly(g2, tooltip = c("y"))

Exercise

  • Use a different color palette than Dark2 in the above plot. Which one works better? Why?

Geographic mapping

To see exactly where these stations are, let us visualise the station locations. To do this we use the leaflet library. One disadvantage with leaflet is that coordinate system projections are not fully supported. Spatial data should ideally be in WGS84 lat long coordinate system. You can use Proj4Leaflet plugin to display alternate projection. But that requires the basemap tiles to work with your projection. Fortunately, our dataset is already in WGS84 lat long coordinate system.

library(leaflet)

monitorlocations_df <- unique(ozone_airq_df[,c('Longitude', 'Latitude')]) 

#Note the reversal of traditional Lat/Long paradigm. Longitude is usually on the X-axis. While it is not relevant here, as leaflet understands and parses them correctly, you might as well develop good habits. Many visualisation issues arise from messing these up.

m <-  leaflet(monitorlocations_df) %>%
  addProviderTiles(providers$Stamen.TonerLines, group = "Basemap") %>%
   addProviderTiles(providers$Stamen.TonerLite, group = "Basemap") %>%
  addCircles(group = "Monitors")%>%
  addLayersControl(
    overlayGroups = c("Monitors", 'Basemap'),
    options = layersControlOptions(collapsed = FALSE)
      )

m

# Update the map showing the number of points at different zoom levels.
 m %>%
    addMarkers(clusterOptions = markerClusterOptions(), group = "Monitors")

Choropleth map

Let’s make a choropleth map. For this we need polygonal boundaries and attach data to the boundaries and color it by a value.

First off, some warnings that are worth repeating. Colors are terrible communicators. Sizes are worse. Choropleth maps combine both these bad characteristics (coloring areas). Avoid them if you can. But they are useful for some types of communication. They are also pretty widespread.

To download the county boundary file let’s use a convenience package for US called tigris.

library(tigris)
library(rgdal)

ctys_spdf <- counties(cb=TRUE) #Only generalised boundaries are required
names(ctys_spdf@data)
# [1] "STATEFP"  "COUNTYFP" "COUNTYNS" "AFFGEOID" "GEOID"    "NAME"    
# [7] "LSAD"     "ALAND"    "AWATER"
nrow(ctys_spdf@data)
# [1] 3233
proj4string(ctys_spdf)
# [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
ctys_spdf <- merge(ctys_spdf, county_summary_df, by.x="GEOID", by.y='stcofips', all.x=FALSE) #Make sure to merge with spdf, not spdf@data. Otherwise, the order is jumbled.

nrow(ctys_spdf@data)
# [1] 785

# Because the presence of 0's create a problem for quantiles (some of the quantiles are the same) let's separate them out while calculating breaks.

Qpal <- colorQuantile(
  palette = "Reds", n = 5,
  domain = ctys_spdf$percAQIgt100[ctys_spdf$percAQIgt100>0]
)


labels <- sprintf(
  "County: %s <br/> AQI>100 days: <strong>%s</strong> %%",
  paste(ctys_spdf$County_Name, ctys_spdf$State_Name, sep=","),prettyNum(ctys_spdf$percAQIgt100*100, digits=2)
) %>% lapply(htmltools::HTML)

m <-  leaflet(ctys_spdf) %>%
  addProviderTiles(providers$Stamen.TonerLines, group = "Basemap") %>%
   addProviderTiles(providers$Stamen.TonerLite, group = "Basemap") %>%
       addPolygons(color = "#CBC7C6", weight = 1, smoothFactor = 0.5,
              opacity = 1.0, fillOpacity = 0.5,
             fillColor = Qpal(ctys_spdf$percAQIgt100),
              highlightOptions = highlightOptions(color = "green", weight = 2, bringToFront = TRUE),
             label = labels,
             labelOptions = labelOptions(
               style = list("font-weight" = "normal", padding = "3px 8px"),
               textsize = "15px",
               direction = "auto"),
             group = "Counties"
             )%>%
  addLayersControl(
    overlayGroups = c("Counties", 'Basemap'),
    options = layersControlOptions(collapsed = FALSE)
      ) 

  
  m 

Exercise

  • Add markers to the choropleth map.
  • Change the color scheme of the polygonal fills.
  • Change the line weight of the polygonal borders.

Conclusions

We covered a lot of ground in this tutorial. I do not expect all of this to make sense right away and you will be proficient by the end of this tutorial. I constantly refer to old code to reuse, make mistakes, correct them, solve small problems.

The main point is not to use analysis for its own sake, but to illustrate a story. Exploratory data anlysis will get you a long way in telling the story. You should spend a lot of time cleaning the data, understanding it and figuring out the best way to communicate your insights.

Additional references

Always use the help commands. One of the strengths of R is its effective doucumentation, which includes examples and usage. Help on a particular command can be accessed with ? e.g. ?group_by.

The following books are useful for reference.

  • Grolemund, Garrett and Hadley Wickham (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’ Reilly media,. http://r4ds.had.co.nz/.

  • Venables, W.N. and Ripley, B.D. (2003). Modern Applied Statistics with S. New York, NY: Springer

In particular, Grolemund and Wickham is excellent for data munging operations. This usually takes up large portion of the analytical work.

Related

comments powered by Disqus